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DISCUSSION: ONE-STEP SPARSE ESTIMATES IN NONCONCAVE 
PENALIZED LIKELIHOOD MODELS 

By Peter Buhlmann and Lukas Meier 
ETH Zurich 

Hui Zou and Runze Li ought to be congratulated for their nice and in- 
teresting work which presents a variety of ideas and insights in statistical 
methodology, computing and asymptotics. 

We agree with them that one- or even multi-step (or -stage) procedures 
are currently among the best for analyzing complex data-sets. The focus of 
our discussion is mainly on high-dimensional problems where p> n: we will 
illustrate, empirically and by describing some theory, that many of the ideas 
from the current paper are very useful for the p^§> n setting as well. 

1. Nonconvex objective function and multi-step convex optimization. The 

paper demonstrates a nice, and in a sense surprising, connection between 
difficult nonconvex optimization and computationally efficient Lasso-type 
methodology which involves one- (or multi-) step convex optimization. The 
SCAD-penalty function [5] has been often criticized from a computational 
point of view as it corresponds to a nonconvex objective function which is 
difficult to minimize; mainly in situations with many covariates, optimizing 
SCAD-penalized likelihood becomes an awkward task. 

The usual way to optimize a SCAD-penalized likelihood is to use a local 
quadratic approximation. Zou and Li show here what happens if one uses a 
local linear approximation instead. In 2001, when Fan and Li [5] proposed 
the SCAD-penalty, it was probably easier to work with a quadratic approx- 
imation. Nowadays, and because of the contribution of the current paper, 
a local linear approximation seems as easy to use, thanks to the homotopy 
method [12] and the LARS algorithm [4]. While the latter is suited for linear 
models, more sophisticated algorithms have been proposed for generalized 
linear models; cf. [6, 8, 13]. 

In addition, and importantly, the local linear approximation yields sparse 
model fits where quite a few or even many of the coefficients in a linear or 
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generalized linear model are zero, that is, the method does variable selec- 
tion. From this point of view, the local linear approximation is often to be 
preferred. In fact, it closely corresponds to the adaptive Lasso [17] which is, 
in our view, very useful for variable selection with Lasso-type technology. 
The rigorous convergence results in Section 2.3 of the paper, with a nice 
ascent property as for the EM-algorithm, are further reasons to favor the 
local linear approximation over the local quadratic versions with its heuris- 
tic rule for ad-hoc thresholding to zero [as described in the paragraph after 
formula (2.4) of the paper]. Finally, in Theorem 2 of the paper, the local 
linear approximation is shown, to yield the best convex majorization of the 
penalty function. 

1.1. Connection to the adaptive Lasso for type 1 penalty functions. Sec- 
tion 4 in the paper distinguishes the cases where the regularization param- 
eter can be separated from the penalty function or not. Type 1 penalty 
functions p\(t) = Xp(t) allow for separation, for example, the Bridge penal- 
ties and the logarithm penalty, but excluding SCAD. From a computational 
point of view, the type 1 penalties are to be preferred because path-following 
algorithms can be used: this is Algorithm 1 in Section 4 of the paper, allow- 
ing to compute the whole regularization path very efficiently In contrast, 
Algorithm 2, which can be used for the one-step SCAD estimator, seems 
much less efficient for approximating the entire regularization paths. 

Question. Why should we use the one-step SCAD estimator? In partic- 
ular, (at least some of the) one-step type 1 penalty estimators, for example, 
the adaptive Lasso as discussed below, have the same asymptotic oracle 
properties, under the same conditions, as the one-step SCAD presented in 
Theorem 3. 

Consider now type 1 penalty functions. Formulae (3.1) and (3.3) of the 
paper describe the one-step estimator based on the local linear approxima- 
tion. In formula (3.1) corresponding to a linear model, we see that only the 
penalty function involves the initial estimator: the estimator can be written 

as 

/3«=argmini||y-X/?|| 2 + nAf>(|/3f \)\(3,\, 

where the penalty is based on re-weighting the ^i-norm (or Lasso-penalty) 
with weights Wj = w(\m |) depending on the initial estimator. Of course, 
the weights also depend on the type 1 penalty function which we aim to 
approximate with the local linear approximation. 
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This is exactly the idea of the adaptive Lasso, recently proposed by [17]. 
We think it is important to emphasize this connection (which is not men- 
tioned at all in the paper) because: (i) it is a simple and very effective idea 
to reduce the bias of the Lasso, see below; (ii) the adaptive Lasso is the- 
oretically supported [17] and enjoys the same oracle result as the one-step 
SCAD described in Theorem 3, and there are theoretical results even for the 
high-dimensional situation where p^>n [7] . 

Regarding issue (i), particularly in cases with many ineffective (or non- 
substantial) covariates, the prediction optimal Lasso typically needs a large 
penalty parameter to get rid of these ineffective covariates. But a large 
penalty parameter implies substantial shrinkage to zero even for the coeffi- 
cients corresponding to the important covariates. Solutions to improve such 
bias problems of the Lasso are based on two-stage procedures, for example, 
the LARS-OLS hybrid [4], the adaptive Lasso [17] or the relaxed Lasso [9]. 

2. Variable selection in the high-dimensional case. For simplicity, con- 
sider a linear model 



as discussed in Section 3.1 of the paper. Our goal is variable selection (which 
is in many applications more relevant than prediction; we will present an 
example from biology in our Section 3.2). In high-dimensional problems 
where p^> n, computational aspects become crucial. Since there are 2 P sub- 
models, we cannot inspect all of them (even when using efficient branch- 
and-bound methods). Ad-hoc methods can be used, but they may be very 
unstable, yielding poor results (e.g., forward variable selection); see [2]. On 
the other hand, having provably correct algorithms or methods, as the one 
in the paper with provable properties, is much more desirable. 

The Lasso and its modifications belong to the latter class of methods. 
Regarding the computational feasibility, for linear models, the complexity 
to compute all sub-models from Lasso is 0(npmm(n,p)) which is linear 
in the dimensionality p if p S> n. An important question is whether such 
computationally efficient estimators have good, provable statistical proper- 
ties. Meinshausen and Biihlmann [10] showed consistency of the Lasso for 
variable selection in the high-dimensional setting where p> n: there is one 
major assumption, the neighborhood stability condition, which was shown 
to be sufficient and "almost" necessary (the wording "almost" refers to a 
numerical value which has to be < 1 for sufficiency and < 1 for necessity). 
Later, the irrepresentable condition has been worked out [16, 17] which is 
easier to interpret but is equivalent to the Meinshausen-Biihlmann assump- 
tion on neighborhood stability. The irrepresentable assumption is restrictive 
and easily fails to hold if the design matrix exhibits a too strong "degree of 




y = X/3 + e 
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linear dependence" (of course, there is always linear dependence if p S> n) 
or a too strong population absolute correlation among the covariates. 

Since the irrepresentable (or neighborhood stability) condition is restric- 
tive, one would like to understand Lasso's behavior under weaker assump- 
tions: recently, consistency results in terms of 

(2) \\P(X)-P\\ q = o P (l) (n->oo), qe{l,2} 

have been achieved; see [11, 14, 15]. The result in (2) has implications for 
variable selection. In case of fixed dimension p, (2) implies if (3j ^ 0, then 
f3j(\) 7^ with probability tending to 1 [because otherwise the convergence 
to zero in (2) would not hold]. That is, 

The Lasso yields a too large model which contains the true model 
with high probability (tending to 1 as n — > oo). 

Under suitable conditions, the statement in (3) also holds in the high- 
dimensional case where p^>n; see [11]. In addition, we point out that 

The prediction optimal (w.r.t. MSE) tuned Lasso contains the true 
model with high probability (tending to 1 as n — > oo). 

This has been proved for simple cases in [10]. 

Putting the two facts (3) and (4) together, we can view the Lasso as an 
excellent and computationally efficient tool for "variable filtering," in the 
sense that the true model is with high probability a subset of the Lasso- 
estimated model. To appreciate the value of such a result, imagine that 
we have p ~ lO'OOO and n ~ 50 (e.g., from microarray data). As the size of 
the Lasso-estimated sub-model is bounded by min(n,p), which equals n if 
p^> n, we pursue an immense dimensionality reduction from p ~ lO'OOO to 
something of the order 50. 

When viewing the Lasso as a variable filtering method, it is clear that 
we want to do an additional step (similar to the one-step estimator in the 
paper) which aims to go from the Lasso-estimated model in the first stage to 
the true model in a second stage. We have already touched upon two-stage 
procedures for addressing the bias problem in Lasso. The main proposals 
are the LARS-OLS hybrid [4], the relaxed Lasso [9] and the adaptive Lasso 
[17] with the Lasso as initial estimator: all of them reduce the bias and, 
in fact, this is what will lead to consistency in variable selection. We think 
(based on empirical evidence) that the latter, which is essentially the one- 
step estimator in the paper when using the Lasso as initial estimator, is a 
very elegant way to address Lasso's overestimation behavior. In addition, 
some theory for consistency in variable selection has been worked out for 
the high-dimensional case [7]. 
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3. Beyond the one-step estimator. For regularization in high-dimensional 
spaces, we may want to use more than one or two regularization parame- 
ters. This can be achieved by pursuing more iterations where every iteration 
involves a separate tuning parameter (and as described below, those param- 
eters are "algorithmically" constrained). We propose here the 

Multi-Step Adaptive Lasso (MSA-LASSO): 

1. Initialize the weights Wj = 1 (j = 1, . . . ,p). 

2. For k = 1,2,..., M, 

use the adaptive Lasso with penalty function 

where A* is the regularization parameter leading to prediction optimal- 

ity Denote the estimator by (3^ = /?( fe )(A* ). In practice, the value A* 
can be chosen via some cross-validation scheme. 
Up-date the weights 

(k) I • _ -, 

Wj ~\(3^(\i k -\\ J , "' ,P ' 

For k = 1 , we do an ordinary Lasso fit and k = 2 corresponds to the adaptive 
Lasso. Note that what is termed "one-step" in the paper corresponds here 
to k = 2. Note that Zou and Li initialize with = (in the terminology of 
MSA-LASSO), yielding the MLE (in step k = 1). We find it more natural, 
and actually essential in the high-dimensional case with p^> n, to initialize 
with the nonzero weights allowing for regularized fitting in step k = 1 . 

We will illustrate below the MSA-LASSO on a small simulated model and 
a real data set from molecular biology. Before, we describe some properties 
of the method which are straightforward to derive. 

Property 1. MSA-LASSO increases the sparsity in every step in terms 
of the ^o- norm ? that is, fewer selected variables in every step. As "heuristics," 
which is derived from the Zou and Li paper, MSA-LASSO is related to 
approximating the nonconvex optimization problem with the log-penalty 
Sj=i log(|/3j |); see the formula appearing just before Proposition 2. 

Property 2. MSA-LASSO can be computed using the LARS algorithm 
for every step. The computational complexity of MSA-LASSO is bounded 
by 0(Mnpmm(n,p)); due to the increase of sparsity, a later step is faster 
to compute than an early one. The computational load is in sharp contrast 
to computing all solutions over all M steps when allowing for any A for each 
Lasso path: this would require many more essential operations. 
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MSA-LASSO is different from the multi-step procedure as analyzed in 
Section 2.3 of the paper: there, the regularization parameter A is fixed. We 
will also discuss below in Tables 1 and 2, for a simulated example, that 
the algorithmic restriction of choosing the regularization parameters in a 
sequentially optimal fashion seems very reasonable. 

3.1. Small simulation study. To illustrate the proposed MSA-LASSO 
method, a small simulation study is carried out. We use a linear model 
as in (1) with covariates X from a multivariate normal distribution with 
correlation matrix Sjj = p\ % ~i\ (for various values of p). The true underly- 
ing parameter vector is of the form j3 = (c, . . . , c, 0, . . . , 0) T with p act nonzero 
entries and c such that the signal-to-noise ratio is 9 (which we find more 
relevant for practical applications than a signal-to-noise ratio of 21.25 as in 
example 1 in the paper). The number of predictors is set to p = 500. We 
choose the number of active variables p ac t € {3, 25}. In each simulation run, 
a training set of size 100 and a validation set of size 50 is used to determine 
the prediction optimal estimator. A total of 100 simulation runs are carried 
out for each parameter setting. 

As performance measures we use the squared error \\(3 — (3\\2 and the 
number of false positives (FP) Y3=\ 0> flj = 0)- Table 1 illustrates the 

results for the case p act = 3. We denote by 1-step (k = 2) the MSA-LASSO 
with k = 2: it equals the adaptive Lasso with Lasso as initial estimator 
and we use the terminology "1-step" to be more consistent with the paper. 
Furthermore, 1-step opt. is the adaptive Lasso with Lasso as initial estimator 
(as for 1-step), but we optimize over a large 2-dimensional grid of the two 
regularization parameters which are involved in the initial Lasso and the 
adaptive Lasso step. Several conclusions can be made. The estimation error 
can be slightly decreased by an additional step if the correlation is not too 
high. More importantly, the number of false positives gets reduced. The 



Table 1 

Results /or p ac t =3 active variables. Average and standard deviations (in parentheses) of 
squared errors and of false positives (FP) 







p = 


p = 0.2 


p = 0.5 


p = 0.8 








Squared 


error 




1-Step (MSA-LASSO k 


= 2) 


0.12 (0.16) 


0.09 (0.09) 


0.11 (0.11) 


0.21 (0.23) 


2-Step (MSA-LASSO k 


= 3) 


0.08 (0.12) 


0.08 (0.10) 


0.09 (0.09) 


0.23 (0.30) 


1-Step opt. 




0.08 (0.09) 


0.09 (0.10) 


0.10 (0.09) 


0.19 (0.17) 








False positives (FP) 




1-Step (MSA-LASSO k 


= 2) 


6.10 (12.07) 


3.82 (6.59) 


3.73 (5.00) 


3.19 (6.23) 


2-Step (MSA-LASSO k 


= 3) 


2.93 (7.53) 


2.19 (6.24) 


1.65 (2.69) 


2.19 (5.09) 


1-Step opt. 




2.92 (7.13) 


3.08 (7.55) 


2.88 (6.00) 


3.36 (5.56) 
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Table 2 

Results for 25 active variables. Average and standard deviations (in parentheses) of 
squared errors and of false positives (FP) 



p = 



p = 0.2 



p = 0.5 



p = 0.8 



1- Step (MSA-LASSO k = 2) 6.15 (1.45) 

2- Step (MSA-LASSO k = 3) 6.25 (1.50) 
1-Step opt. 6.03 (1.49) 

1- Step (MSA-LASSO k = 2) 32.91 (23.66) 

2- Step (MSA-LASSO k = 3) 27.43 (23.54) 
1-Step opt. 31.55 (22.11) 



Squared error 

3.23 (1.06) 1.78 (0.45) 1.57 (0.32) 

3.28 (1.12) 1.92 (0.48) 1.73 (0.35) 

2.95 (1.04) 1.48 (0.49) 1.25 (0.33) 
False positives (FP) 

30.26 (18.16) 17.91 (16.37) 8.42 (9.01) 

25.52 (17.44) 14.60 (15.76) 6.35 (6.73) 

24.55 (15.55) 9.48 (11.37) 2.76 (5.61) 



number of false negatives is zero for this setting (not shown), that is, the 
true variables are always identified. The computational extra effort of the 
1-Step opt. estimator does not pay off in this situation. 

The results for p act = 25 are given in Table 2. There is a slight loss in terms 
of mean squared error (MSE) for this setting when doing an additional step. 
Already, the 1-Step estimator loses compared to the initial Lasso estimator 
(not shown), in terms of MSE; likewise, the 1-Step opt. estimator has worse 
performance than the initial Lasso estimator. However, the number of false 
positives (FP) gets reduced again due to the increased sparsity. In terms of 
FP, the 1-Step opt. estimator seems to perform better for moderate and large 
values of p; but an additional step (k = 4) in MSA-LASSO would improve 
performance with respect to FP as well. 

3.2. Real data example from biology. Reducing the number of false pos- 
itives can be very desirable in biological applications since follow-up experi- 
ments can be costly and laborious. In our experience, it is often appropriate 
to do estimation on the conservative side with a low number of false positives 
because we still see more positives than what can be typically validated in 
a laboratory. 

We illustrate the MSA-LASSO method on a problem of motif regression 
[3] for finding transcription factor binding sites in DNA sequences. [1] con- 
tains a collection of microarray data and a collection of motif candidates for 
yeast. The idea is to predict the gene expression value of a gene based on the 
corresponding motif scores (the information based on the sequence data). 
The data set which we consider consists of n = 2587 gene expression values 
of a heat-shock experiment and p = 666 motif scores. We use a training set 
of size 1300 and a validation set of size 650. The remaining data is used as 
a test-set. 
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The squared prediction error on the test-set E[(y ncw — y ncw )] = ($ — 
P) T T,(0 — j3) + Var(e) remains essentially constant for all estimators [prob- 
ably due to high noise, that is, large value of Var(e)]: 0.6193, 0.6230, 0.6226 
for the Lasso, 1-Step and 2-Step estimator, respectively. But the number of 
selected variables decreases substantially: 

Lasso 1-Step 2-Step 
number of selected variables 91 42 28 

The list of top-candidate motifs gets slightly rearranged between the differ- 
ent estimators. The hope (and in part a verified fact) is that the 1- or 2-Step 
estimator yield more stable lists with fewer false positives. 
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